Informational Divergence Approximations 
to Product Distributions 



Jie Hou and Gerhard Kramer 
Institute for Communications Engineering 
Technische Universitat Miinchen, 80290 Munich, Germany 
Email: {jie. hou, gerhard.kramer}@tum.de 



o 
in 



C/5 

O 



(N 
> 

in 

(N 
O 

(N 
O 
m 



x 



Abstract — The minimum rate needed to accurately approxi- 
mate a product distribution in terms of an unnormalized infor- 
mational divergence is shown to be a mutual information. The 
result follows by evaluating nonasymptotic results of Hayashi. 
An alternative and direct proof is given that extends to cases 
where the source distribution is unknown but the source entropy 
is known. 

I. Introduction 

What is the minimal rate needed to generate a good approx- 
imation of a target distribution with respect to some distance 
measure? For example, to learn a system response, we might 
give inputs to the system and compute the output statistics. 
However, in computer simulations the inputs are only some 
approximations of the true distributions that are generated with 
random number generators. We would like to use a small 
number of bits to generate good approximations of a target 
distribution. 

Wyner considered such a problem and characterized the 
smallest rate needed to approximate a product distribution 
accurately when using the normalized informational diver- 
gence as the distance measure between two distributions. The 
smallest rate is a Shannon mutual information 0. Han-Verdu 
12 showed that the same rate is necessary and sufficient 
to generate distributions arbitrarily close to an information 
stable distribution in terms of variational distance. Note that 
normalized informational divergence and variational distance 
are not necessarily larger or smaller than the other. Hayashi 
studied resolvability using the unnormalized informational 
divergence and derived results for non-asymptotic cases that 
can be extended to asymptotic cases. 

We show that the minimal rate needed to make the unnor- 
malized informational divergence between a target product dis- 
tribution and the approximating distribution arbitrarily small is 
the same Shannon mutual information as in 0, 0. This result 
also follows from Lemma 2] although it was not stated in 
0. Thus, our main contributions might be considered to be 
an alternative proof to that in [3 ], and we extend the proof to 
cases where the encoder has a non-uniform input distribution. 
In any case, we emphasize that the result implies the results in 
and when restricting attention to product distributions 
(in particular Theorem 6.3 in and Theorem 4 in 0). 

The paper is organized as follows. In Section HU we state 
the problem. In Section [Til] we state and prove the main result. 
Section IIV] discusses extensions. 



W ={!,..., M} 



Encoder 


Jjn 


Qv\u 





Fig. 1, Coding problem with the goal of making Pyn ss 



II. Preliminaries 

Random variables are written with upper case letters and 
their realizations with the corresponding lower case letters. Su- 
perscripts denote finite-length sequences of variables/symbols, 
e.g., X n — Xi, . . . ,X n . Subscripts denote the position of a 
variable/symbol in a sequence. For instance, Xi denotes the 
i-th variable in X n . A random variable X has probability 
distribution Px and the support of Px is denoted as supp(Px)- 
We write probabilities with subscripts Px{x) but we drop 
the subscripts if the arguments of the distribution are lower 
case versions of the random variables. For example, we write 
P(x) = Px(x). If the Xi, i = 1, . . . , n, are independent and 
identically distributed (i.i.d.) according to Px, then we have 
P{x n ) = n™ = i Px{xi) and we write P X n = P x . Calligraphic 
letters denote sets. The size of a set S is denoted as |<S|. We 
use T™(Px) to denote the set of letter-typical sequences of 
length n with respect to the probability distribution Px and 
the non-negative number e Ch. 3], 0, i.e., we have 

( _ N{a\x n ) 



T?(P X ) = 



-Px(a) < ePx(a), Ma e X 



where 7V(a|x n ) is the number of occurrences of a in x n . 

Consider the system depicted in Fig.Q] The random variable 
W with cardinality M = 2 nR is uniformly distributed over 
{1, . . . , M} and is encoded to sequences U n . V n is generated 
from U n through a memoryless channel Qyi v and has distri- 
bution Py n . A rate R is achievable if for any S > there is 
a sufficiently large n and an encoder such that 
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is less than S. We wish to determine the smallest achievable 
rate. 

III. ACHIEVABILITY 

Theorem 1: For a given target distribution Qv, the rate R 
is achievable if R > I(V; U), where I(V; U) is calculated 
with some joint distribution Quv that has marginal Qy 



and |supp(Qy)| < |V|. The rate R is not achievable if 
R < J(V; U) for all Q uv with |supp(Q ;7 )| < |V|. 

Proof: Suppose U and V have finite alphabets hi and 
V, respectively. Let Quv be a probability distribution with 
marginals Qjj and Qy. Let U n V n ~ i.e., for any u" e 

W n , v n <E V" we have 
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Q(u n ,v n ) = HQuv(u i ,v i ) = Q uv (u n ,v n ) (2) 

1=1 
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Q(u n ) = l[Q u (u l ) = Q u (u n ) (3) 

1=1 

Q(« n ) = II<M«i) = Qv(« n ) W 

2=1 

QO"K) = nQv|i/(«iK) = Q^ii/KK). (5) 

i=l 

Let C = {J7 n (tu)}^f =1 , where the £/™(w), iu = 1, . . . , M, are 
generated in an i.i.d. manner using Qjj. V n is generated from 
U n (W) through the channel Qy^ (see Fig.©. We have 
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Note that if for a u n we have 



E Q&(« n )Qvi^(« n l« n ) = o ( 7 ) 

«™esupp(Qg) 

then we have 

= 0, for all u n € supp(Q^). (8) 

This means P(v n ) = and supp(Py™) C supp(Q y ) so that 
D(P V n\\Q v ) < oo. We further have 



Q&(««) 



= £Q&(« B ) 
= 1 



(9) 



The average informational divergence over all codebooks C is 
(recall that P(w) = jj, w = 1, . . . , M): 



E[D(P V ,\\Q V )] ( = } E 



log 



Z*Lij r Qv-\u-(v n \u n ti)) 
Q v (v n ) 



< 



^ M 

W 

(b)^ 1 

< > E 

^ M 

W 



Y.f=iQv\u{v n \u«{j)) 



( =>e 



where 



M 
log 



log 



log 



log 



W = w 



MQ v {V n ) 
}l\u{V n \U n (w)) M-l 



W- 



-u n {-) 




U n (W) nn 
* ^v\u 


t 






U n (l)U n { 


2)---U n jM) 


\ * 


\ / 


( 









.yr, 



Fig. 2. The random coding experiment. 

(a) follows by taking the expectation over 
W,U n (l), . . . ,U n (M),V n ; 

(b) follows by the concavity of the logarithm and Jensen's 
inequality applied to the expectation over the U n (j),j ^ 
w, and by using (0; 

(c) follows by choosing U n V n ~ Quv- 
Alternatively, we can make the steps ( TTOb more explicit: 
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We may write ( fTOb or (fTTl as 
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Using standard inequalities (see J5]) we have 

di < 
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2-n(H(V|C0-e) 

< log ' 
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M . 2 -n(i?(y)+e) 1 
= log (2-n{R-I{V;U)-^e) + {j 

< log(e) • 2 -»(«-J(VilO-2e) 
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and di ->• if i? > I(V; t/) + 2e and n -> oo. We further 
have 
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< 2|V| • |W| • e~ 2ne2fl iy log 
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where 

[i y = min^ Gsupp(Qv) Q(«) 
/ic/v =min( ViU ) esU pp(Q uv )(5(u,'y) 

If < i we have 

d 2 < 2|V| • |W| . e - 2ne2 ^v - log 2 
and do — > as n — > oo. If ^XML > 1 we have 

di < 2|V| • |W| • e - 2 " e2 ^v- . n . i g + 

and d 2 — > as n — > oo. 

Combining the above we have 

E[D(PvA\Qv)] ^0 
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(15) 
(16) 
(17) 
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if R > I(V; U) + 2e and n oo. As usual, (l20b means 
that there must exist a code with D{Pyn \\Qy) < S for any 
5 > and sufficiently large n. This proves the coding theorem. 
The converse follows from |fl] Theorem 5.2] by removing the 
normalization factor — . ■ 

n 

Remark 1: Theorem 1 is proved only for discrete and finite 
random variables. However, extensions to continuous random 
variables should be possible. 

Remark 2: The cardinality bound on supp(Qu). can be 
derived using techniques from J6] Ch. 15]. 

Remark 3: If V = U, then we have R > H(V). 

Theorem 1 is proven using a uniform W which represents 
strings of uniform bits. If we use a non-uniform W for 



the coding scheme in Theorem 1, can we still drive the 
unnormalized informational divergence to zero? We give the 
answer in the following lemma. 

Lemma 1: Let W = B nR be a bit stream with nR bits 
that are generated i.i.d. with a binary distribution Px with 
-Px(O) = p, < p < \. The rate R is achievable if 

I(V;U) 



R > 



(21) 



H 2 ( P ) 

where H 2 (-) is the binary entropy function. 

Proof: The proof is given in the Appendix. ■ 
Remark 4: Lemma 1 states that even if W is not uniformly 
distributed, the informational divergence can be made small. 
This is useful because if the distribution of W is not known 
exactly, then we can choose R large enough to guarantee the 
desired resolvability result. 

IV. Discussion 

Hayashi studied the resolvability problem using unnormal- 
ized divergence and he derived bounds for nonasymptotic 
cases [3 , Lemma 2]. Theorem 1 can be derived by extending 
J3] Lemma 2] to asymptotic cases and it seems that such 
a result was the underlying motivation for [3, Lemma 2]. 
Unfortunately, Theorem 1 is not stated explicitly in J5] and 
the ensuing asymptotic analysis was done for normalized 
informational divergence. Hayashi's proofs (he developed two 
approaches) were based Shannon random coding. 

Theorem 1 implies [1 Theorem 6.3] which states that for 
R > I(V;U) the normalized divergence —D(Pyn\\Qy) can 
be made small. Theorem 1 implies 12,, Theorem 4] for product 
distributions through Pinkser's inequality flT] Lemma 11.6.1] 

D(Px\\Qx) > t^WPx - Qx\\tw (22) 



2 In 2' 



where 



\Px - Qx\\tv = J2 l p ( x ) - Q( x )\ 



(23) 



Moreover, the speed of decay in (fL3l , (U~8l and (JT9J is (almost) 
exponential with n. We can thus make 



a(n)-E[D(P vn \\Q%)] 



(24) 



vanishingly small as n — > oo, where a(n) represents a sub- 
exponential function of n that satisfies, 

n ■ a(n) 



lim 



o/3n 







(25) 



where /3 is positive and independent of n (see also O). For 
example, we may choose a(n) = n m for any integer to. 

Since all achievability results in J8) are based on [2, Theo- 
rem 4], Theorem 1 extends the results in fl8] as well. Theorem 
1 is further closely related to strong secrecy |5] and provides 
a simple proof that Shannon random coding suffices to drive 
an unnormalized mutual information between messages and 
eavesdropper observations to zero. 

Theorem 1 is valid for approximating product distributions 
only. However extensions to a broader class of distributions, 
e.g., information stable distributions J2], are clearly possible. 



Finally, an example code is as follows (courtesy of F. 
Kschischang). Consider a channel with input and output 
alphabet the 2 7 binary 7-tuples. Suppose the channel maps 
each input uniformly to a 7-tuple that is distance or 1 away, 
i.e., there are 8 channel transitions for every input and each 
transition has probability g. A simple "modulation" code for 
this channel is the (7, 4) Hamming code. The code is perfect 
and if we choose each codeword with probability j^, then the 
output V 7 of the channel is uniformly distributed over all 2 7 
values. Hence I(V; U) = 4 bits suffice to "approximate" the 
product distribution (here there is no approximation). 

Appendix A 
Non-Uniform W 

Observe that H(W) = H{B nR ) = nR ■ H 2 (p). Following 
the same steps as in dTOb we have 
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d!= P H E Quv(u n (w),V n ) 

weT™(Pj£) (u™(w),v")£T?(QZ v ) 
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We can bound d\ to ^3 as follows 



di< J2 p H 

u,eT"(p™) 



log 



2n(I(V;U)+2e) 
2 n(R,H 2 (p)-e) 



< log (2-^R-H2(p)-I(V:U)-3e) + A 

< log(e) • 2- n{R - H2{p) - I{v > u) - :ie) 



which goes to zero if R > /( -^^^| 3e and n — > 00. We also 
have 

d 2 < Yl P H E Quv{u n {w),v n ) 

'(1 -P) ■ "v\u s 



log 



< 2|V| • \U\ ■ e - 2ne2 ^ log 
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which goes to zero as n — > 00 (see (TBI). We further have 

d 3 < E p H E Ow(« n w,B n ) 



' / / (l-p)-v v \u\ n . , 



< 2 P(w) 

< 4 • e - 2ne2p2 log- 



log 
v v\u 



(1 -p) ■ vy\i 
+ 1 



(30) 



which goes to zero as n — > 00 (see (fl~8T > and (fl9li). 
Combining the above for non-uniform we have 

E[D(iV»||Qy)]->0 (31) 

if JJ > /( ^{ 3e and n -> 00. 
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